Empirical performance evaluation of page segmentation algorithms
نویسندگان
چکیده
Document page segmentation is a crucial preprocessing step in Optical Character Recognition (OCR) system. While numerous segmentation algorithms have been proposed, there is relatively less literature on comparative evaluation | empirical or theoretical | of these algorithms. We use the following ve step methodology to quantitatively compare the performance of page segmentation algorithms: 1) First we create mutually exclusive training and test dataset with groundtruth, 2) we then select a meaningful and computable performance metric, 3) an optimization procedure is then used to automatically search for the optimal parameter values of the segmentation algorithms, 4) the segmentation algorithms are then evaluated on the test dataset, and nally 5) a statistical error analysis is performed to give the statistical signiicance of the experimental results. We apply this methodology to ve segmentation algorithms, three of which are representative research algorithms and the rest two are well-known commercial products. The three research algorithms evaluated are: Nagy's X-Y cut, O'Gorman's Docstrum and Kise's Voronoi-diagram-based algorithm. The two commercial products evaluated are: Caere Corporation's segmentation algorithm and ScanSoft Corporation's segmentation algorithm. The evaluations are conducted on 978 images from the University of Washington III dataset. It is found that the performance of the Voronoi-based, Docstrum and Caere's segmentation algorithms are not signiicantly diierent from each other, but they are signiicantly better than ScanSoft's segmentation algorithm, which in turn is signiicantly better than the performance of the X-Y cut algorithm. Furthermore, we see that the commercial segmentation algorithms and research segmentation algorithms have comparable performances.
منابع مشابه
Software Architecture of Pset: a Page Segmentation Evaluation Toolkit Software Architecture of Pset: a Page Segmentation Evaluation Toolkit
Empirical performance evaluation of page segmentation algorithms has become increasingly important due to the numerous algorithms that are being proposed each year. In order to choose between these algorithms for a speciic domain it is important to empirically evaluate their performance. To accomplish this task the document image analysis community needs i) standardized document image datasets ...
متن کاملSegmentation Evaluation
Empirical performance evaluation of page segmentation algorithms has become increasingly important due to the numerous algorithms that are being proposed each year. In order to choose between these algorithms for a speciic domain it is important to empirically evaluate their performance. To accomplish this task the document image analysis community needs i) standardized document image datasets ...
متن کاملA Methodology for Empirical Performance Evaluationof Page Segmentation AlgorithmsSong
Document page segmentation is a crucial preprocessing step in Optical Character Recognition (OCR) systems. While numerous page segmentation algorithms have been proposed , there is relatively less literature on comparative evaluation | empirical or theoretical | of these algorithms. For the existing performance evaluation methods, two crucial components are usually missing: 1) automatic trainin...
متن کاملPSET: A Page Segmentation Evaluation Toolkit
Empirical performance evaluation of page segmentation algorithms has become increasingly important due to the numerous algorithms that are being proposed each year. In order to choose between these algorithms for a specific domain it is important to empirically evaluate their performance. To accomplish this task the document image analysis community needs i) standardized document image datasets...
متن کاملEmpirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms
ÐWhile numerous page segmentation algorithms have been proposed in the literature, there is lack of comparative evaluationÐempirical or theoreticalÐof these algorithms. In the existing performance evaluation methods, two crucial components are usually missing: 1) automatic training of algorithms with free parameters and 2) statistical and error analysis of experimental results. In this paper, w...
متن کامل